595 research outputs found
Approximate Computation and Implicit Regularization for Very Large-scale Data Analysis
Database theory and database practice are typically the domain of computer
scientists who adopt what may be termed an algorithmic perspective on their
data. This perspective is very different than the more statistical perspective
adopted by statisticians, scientific computers, machine learners, and other who
work on what may be broadly termed statistical data analysis. In this article,
I will address fundamental aspects of this algorithmic-statistical disconnect,
with an eye to bridging the gap between these two very different approaches. A
concept that lies at the heart of this disconnect is that of statistical
regularization, a notion that has to do with how robust is the output of an
algorithm to the noise properties of the input data. Although it is nearly
completely absent from computer science, which historically has taken the input
data as given and modeled algorithms discretely, regularization in one form or
another is central to nearly every application domain that applies algorithms
to noisy data. By using several case studies, I will illustrate, both
theoretically and empirically, the nonobvious fact that approximate
computation, in and of itself, can implicitly lead to statistical
regularization. This and other recent work suggests that, by exploiting in a
more principled way the statistical properties implicit in worst-case
algorithms, one can in many cases satisfy the bicriteria of having algorithms
that are scalable to very large-scale databases and that also have good
inferential or predictive properties.Comment: To appear in the Proceedings of the 2012 ACM Symposium on Principles
of Database Systems (PODS 2012
Algorithmic and Statistical Perspectives on Large-Scale Data Analysis
In recent years, ideas from statistics and scientific computing have begun to
interact in increasingly sophisticated and fruitful ways with ideas from
computer science and the theory of algorithms to aid in the development of
improved worst-case algorithms that are useful for large-scale scientific and
Internet data analysis problems. In this chapter, I will describe two recent
examples---one having to do with selecting good columns or features from a (DNA
Single Nucleotide Polymorphism) data matrix, and the other having to do with
selecting good clusters or communities from a data graph (representing a social
or information network)---that drew on ideas from both areas and that may serve
as a model for exploiting complementary algorithmic and statistical
perspectives in order to solve applied large-scale data analysis problems.Comment: 33 pages. To appear in Uwe Naumann and Olaf Schenk, editors,
"Combinatorial Scientific Computing," Chapman and Hall/CRC Press, 201
Revisiting the Nystrom Method for Improved Large-Scale Machine Learning
We reconsider randomized algorithms for the low-rank approximation of
symmetric positive semi-definite (SPSD) matrices such as Laplacian and kernel
matrices that arise in data analysis and machine learning applications. Our
main results consist of an empirical evaluation of the performance quality and
running time of sampling and projection methods on a diverse suite of SPSD
matrices. Our results highlight complementary aspects of sampling versus
projection methods; they characterize the effects of common data preprocessing
steps on the performance of these algorithms; and they point to important
differences between uniform sampling and nonuniform sampling methods based on
leverage scores. In addition, our empirical results illustrate that existing
theory is so weak that it does not provide even a qualitative guide to
practice. Thus, we complement our empirical results with a suite of worst-case
theoretical bounds for both random sampling and random projection methods.
These bounds are qualitatively superior to existing bounds---e.g. improved
additive-error bounds for spectral and Frobenius norm error and relative-error
bounds for trace norm error---and they point to future directions to make these
algorithms useful in even larger-scale machine learning applications.Comment: 60 pages, 15 color figures; updated proof of Frobenius norm bounds,
added comparison to projection-based low-rank approximations, and an analysis
of the power method applied to SPSD sketche
Low-distortion Subspace Embeddings in Input-sparsity Time and Applications to Robust Linear Regression
Low-distortion embeddings are critical building blocks for developing random
sampling and random projection algorithms for linear algebra problems. We show
that, given a matrix with and a , with a constant probability, we can construct a low-distortion embedding
matrix \Pi \in \R^{O(\poly(d)) \times n} that embeds \A_p, the
subspace spanned by 's columns, into (\R^{O(\poly(d))}, \| \cdot \|_p);
the distortion of our embeddings is only O(\poly(d)), and we can compute in O(\nnz(A)) time, i.e., input-sparsity time. Our result generalizes the
input-sparsity time subspace embedding by Clarkson and Woodruff
[STOC'13]; and for completeness, we present a simpler and improved analysis of
their construction for . These input-sparsity time embeddings
are optimal, up to constants, in terms of their running time; and the improved
running time propagates to applications such as -distortion
subspace embedding and relative-error regression. For
, we show that a -approximate solution to the
regression problem specified by the matrix and a vector can be
computed in O(\nnz(A) + d^3 \log(d/\epsilon) /\epsilon^2) time; and for
, via a subspace-preserving sampling procedure, we show that a -distortion embedding of \A_p into \R^{O(\poly(d))} can be
computed in O(\nnz(A) \cdot \log n) time, and we also show that a
-approximate solution to the regression problem can be computed in O(\nnz(A) \cdot \log n + \poly(d)
\log(1/\epsilon)/\epsilon^2) time. Moreover, we can improve the embedding
dimension or equivalently the sample size to without increasing the complexity.Comment: 22 page
Lectures on Randomized Numerical Linear Algebra
This chapter is based on lectures on Randomized Numerical Linear Algebra from
the 2016 Park City Mathematics Institute summer school on The Mathematics of
Data.Comment: To appear in the edited volume of lectures from the 2016 PCMI summer
schoo
- β¦